Introduction: Model Fitting

library(tidyverse)
library(tidymodels)
library(ggplot2)
library(corrplot)
library(ggthemes)
library(kableExtra)
tidymodels_prefer()

set.seed(3435)

diamonds_split <- initial_split(diamonds, prop = 0.80,
                                strata = price)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)

Creating a Recipe

You’ll notice that the textbooks use the lm() function to fit a linear regression. The tidymodels framework, however, has its own structure and flow, which is designed to work with multiple different machine learning models and packages seamlessly.

To fit any model with tidymodels, the first step is to create a recipe. The structure of this recipe is similar to that of lm(); the outcome is listed first, then the features are added:

simple_diamonds_recipe <-
  recipe(price ~ ., data = diamonds_train)

Note that . is a placeholder for “all other variables.” If we call the recipe object now, we can see some information:

simple_diamonds_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          9

More specifically, we see that there are 9 predictors.

We should dummy-code all categorical predictors. We can do that easily with step functions:

diamonds_recipe <- recipe(price ~ ., data = diamonds_train) %>% 
  step_dummy(all_nominal_predictors())

Note that we haven’t specified what type of model we’ll be fitting yet. The other beauty of the recipe is that it can then be directly given to one of many machine learning model “engines.”

Running the above code is essentially like writing down the instructions for a recipe on a sheet of paper. We’ve prepared the recipe to give to the workflow, but we are probably interested in seeing what the results of the recipe itself actually look like. Did the dummy coding work, for example? To apply the recipe to a data set and view the results, we can use prep(), which is akin to setting out a mise en place of ingredients, and bake().

We’ll also use kbl() and kable_styling() from the kableExtra package, which we installed above. It’s not necessary to use these functions, but doing so allows the table of data to display more neatly, so that all the columns and rows are actually legible. We also use head() to select only the first few rows; otherwise the entire data frame would print, which would be time-consuming. Also note the use of scroll_box(), allowing us to scroll through the entire data set.

prep(diamonds_recipe) %>% 
  bake(new_data = diamonds_train) %>% 
  head() %>% 
  kable() %>% 
  kable_styling(full_width = F) %>% 
  scroll_box(width = "100%", height = "200px")
carat depth table x y z price cut_1 cut_2 cut_3 cut_4 color_1 color_2 color_3 color_4 color_5 color_6 clarity_1 clarity_2 clarity_3 clarity_4 clarity_5 clarity_6 clarity_7
0.23 61.5 55 3.95 3.98 2.43 326 0.6324555 0.5345225 0.3162278 0.1195229 -0.3779645 0.0000000 0.4082483 -0.5640761 0.4364358 -0.1973855 -0.3857584 0.0771517 0.3077287 -0.5237849 0.4921546 -0.3077287 0.1194880
0.21 59.8 61 3.89 3.84 2.31 326 0.3162278 -0.2672612 -0.6324555 -0.4780914 -0.3779645 0.0000000 0.4082483 -0.5640761 0.4364358 -0.1973855 -0.2314550 -0.2314550 0.4308202 -0.1208734 -0.3637664 0.5539117 -0.3584641
0.23 56.9 65 4.05 4.07 2.31 327 -0.3162278 -0.2672612 0.6324555 -0.4780914 -0.3779645 0.0000000 0.4082483 -0.5640761 0.4364358 -0.1973855 0.0771517 -0.3857584 -0.1846372 0.3626203 0.3209704 -0.3077287 -0.5974401
0.29 62.4 58 4.20 4.23 2.63 334 0.3162278 -0.2672612 -0.6324555 -0.4780914 0.3779645 0.0000000 -0.4082483 -0.5640761 -0.4364358 -0.1973855 -0.0771517 -0.3857584 0.1846372 0.3626203 -0.3209704 -0.3077287 0.5974401
0.31 63.3 58 4.34 4.35 2.75 335 -0.3162278 -0.2672612 0.6324555 -0.4780914 0.5669467 0.5455447 0.4082483 0.2417469 0.1091089 0.0328976 -0.3857584 0.0771517 0.3077287 -0.5237849 0.4921546 -0.3077287 0.1194880
0.24 62.8 57 3.94 3.96 2.48 336 0.0000000 -0.5345225 0.0000000 0.7171372 0.5669467 0.5455447 0.4082483 0.2417469 0.1091089 0.0328976 0.2314550 -0.2314550 -0.4308202 -0.1208734 0.3637664 0.5539117 0.3584641

Activities:

  • Use the Internet to find documentation about the possible step functions. Name three step functions that weren’t used here and describe what they do.

Linear Regression

Next, we can specify the model engine that we want to fit:

lm_model <- linear_reg() %>% 
  set_engine("lm")

We set up a workflow. This step might seem unnecessary now, with only one model and one recipe, but it can make life easier when you are trying out a series of models or several different recipes later on.

lm_wflow <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(diamonds_recipe)

Finally, we can fit the linear model to the training set:

lm_fit <- fit(lm_wflow, diamonds_train)

We can view the model results:

lm_fit %>% 
  # This returns the parsnip object:
  extract_fit_parsnip() %>% 
  # Now tidy the linear model object:
  tidy()
## # A tibble: 24 × 5
##    term        estimate std.error statistic   p.value
##    <chr>          <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)  5317.      496.      10.7   8.54e- 27
##  2 carat       11282.       54.3    208.    0        
##  3 depth         -58.4       6.25    -9.33  1.07e- 20
##  4 table         -24.1       3.23    -7.45  9.40e- 14
##  5 x            -970.       49.0    -19.8   5.16e- 87
##  6 y               9.08     19.9      0.456 6.49e-  1
##  7 z            -127.       73.2     -1.74  8.23e-  2
##  8 cut_1         596.       25.0     23.8   1.22e-124
##  9 cut_2        -300.       19.9    -15.0   4.77e- 51
## 10 cut_3         153.       17.2      8.90  5.99e- 19
## # … with 14 more rows

Activities:

  • Explain what the intercept represents.

  • Describe the effect of carat. Is it a significant predictor of price? Holding everything else constant, what is the effect on price of a one-unit increase in carat?

The following code generates predicted values for price for each observation in the training set:

diamond_train_res <- predict(lm_fit, new_data = diamonds_train %>% select(-price))
diamond_train_res %>% 
  head()
## # A tibble: 6 × 1
##    .pred
##    <dbl>
## 1 -1363.
## 2  -667.
## 3   199.
## 4  -815.
## 5 -3472.
## 6 -1373.

Now we attach a column with the actual observed price observations:

diamond_train_res <- bind_cols(diamond_train_res, diamonds_train %>% select(price))
diamond_train_res %>% 
  head()
## # A tibble: 6 × 2
##    .pred price
##    <dbl> <int>
## 1 -1363.   326
## 2  -667.   326
## 3   199.   327
## 4  -815.   334
## 5 -3472.   335
## 6 -1373.   336

We might be interested in a plot of predicted values vs. actual values, here for the training data:

diamond_train_res %>% 
  ggplot(aes(x = .pred, y = price)) +
  geom_point(alpha = 0.2) +
  geom_abline(lty = 2) + 
  theme_bw() +
  coord_obs_pred()

It’s fairly clear that the model didn’t do very well. If it predicted every observation accurately, the dots would form a straight line. We also have predicted some negative values for price, and once the actual price is approximately over \(\$5,000\), the model does a pretty poor job.

The odds are that a linear model is simply not the best tool for this machine learning task. It is likely not an accurate representation of f(); remember that by using a linear regression, we are imposing a specific form on the function, rather than learning the function from the data.

In future labs, we’ll try out different models and compare them. Finally, we can calculate the training root mean squared error (RMSE) and the testing RMSE.

diamond_test_res <- predict(lm_fit, new_data = diamonds_test %>% select(-price))
diamond_test_res <- bind_cols(diamond_test_res, diamonds_test %>% select(price))
  
rmse(diamond_train_res, truth = price, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       1126.
rmse(diamond_test_res, truth = price, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       1146.

We can create and view a “metric set” of RMSE, MSE, and \(R^2\) as shown:

diamond_metrics <- metric_set(rmse, rsq, mae)
diamond_metrics(diamond_train_res, truth = price, 
                estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard    1126.   
## 2 rsq     standard       0.920
## 3 mae     standard     736.
diamond_metrics(diamond_test_res, truth = price,
                estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard    1146.   
## 2 rsq     standard       0.918
## 3 mae     standard     751.

Activities:

  • Is there a difference between the three metrics for the training data and the testing data?

  • Do your best to explain why this difference does (or does not) exist.

k-Nearest Neighbors

Now we’ll take the recipe we’ve already created and try fitting a KNN (k-nearest neighbors) model with it! To do this, we’ll use the nearest_neighbor() function, rather than the linear_reg() function, but we’ll still need to select an engine. There is only one R package, or engine, that works with nearest_neighbor(), and that is the kknn package.

To use an engine, we must make sure that the related package is installed and loaded on our machine. The code to do that is below – however, remember that you must keep the install_packages() line commented out to successfully knit this file.

# install.packages("kknn")
library(kknn)

knn_model <- nearest_neighbor() %>% 
  set_engine("kknn")

Unlike a linear regression, however, there is a parameter – more specifically, a hyperparameter – that we have to set to fit a k-nearest neighbors model. That hyperparameter is k – the number of neighbors for the model to consider.

How do we know the “right” or the optimal value of k to use? Well, we don’t! Eventually we’ll discuss the concepts of resampling, cross-validation, and tuning, which will allow us to determine the optimal value of a hyperparameter with relative ease. For now, we’ll go with the default value of k.

Activities:

  • What IS the default value of k here, if we do not specify a value? How did you find out?

Now we add the model and recipe to the workflow and fit the model to the training data set:

knn_wflow <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(diamonds_recipe)

knn_fit <- fit(knn_wflow, diamonds_train)

Trying to view the results, as we did with the linear regression, is not very informative:

knn_fit %>% 
  extract_fit_parsnip()
## parsnip model object
## 
## 
## Call:
## kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(5,     data, 5))
## 
## Type of response variable: continuous
## minimal mean absolute error: 396.2295
## Minimal mean squared error: 619090.6
## Best kernel: optimal
## Best k: 5

It tells us that the “best” value of k is 5, but that is also because it’s the only value we tried, so it doesn’t mean much.

We’ll generate the predictions from this model for the training set and testing set, and then compare the metrics for each, as before:

diamond_train_knn <- predict(knn_fit, new_data = diamonds_train %>% select(-price))
diamond_train_knn <- bind_cols(diamond_train_knn, diamonds_train %>% select(price))

diamond_test_knn <- predict(knn_fit, new_data = diamonds_test %>% select(-price))
diamond_test_knn <- bind_cols(diamond_test_knn, diamonds_test %>% select(price))

diamond_metrics(diamond_train_knn, truth = price, 
                estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     397.   
## 2 rsq     standard       0.990
## 3 mae     standard     200.
diamond_metrics(diamond_test_knn, truth = price,
                estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     779.   
## 2 rsq     standard       0.962
## 3 mae     standard     393.

Activities:

  • Which of the two models – linear regression or k-nearest neighbors – performed better on the testing data?

  • What do you think explains the difference between the training and testing metrics for the KNN model?

Resources

The free book Tidy Modeling with R is strongly recommended.

You can view all the ISLR textbook code written with tidymodels here.